feat: WOS structured author extraction with S2/MinerU merge fallback#14
Open
Caorui-Li wants to merge 1 commit into
Open
feat: WOS structured author extraction with S2/MinerU merge fallback#14Caorui-Li wants to merge 1 commit into
Caorui-Li wants to merge 1 commit into
Conversation
Add Web of Science Starter API as the primary source for Phase 2 author extraction, with intelligent fallback and cross-source affiliation merging. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
structured_author_fetcher.py: WOS Starter API as primary Phase 2 author source, with intelligent fallback chain (WOS+PDF → WOS+S2 → S2+collector)author_name_utils.py: robust name normalization and 5-rule fuzzy matching across WOS/S2/PDF formats (handles initials, accents, inverted order)ConfigUpdatemissingwos_api_keyso key was silently discarded5.710,5 710)_author_debug.jsonoutput for testingAI coding brief
Original request: Integrate WOS Starter API as structured author extraction for Phase 2. WOS gives accurate author lists but no affiliations — affiliations come from PDF (MinerU) when available, otherwise S2. When WOS is not configured, behavior should be identical to before.
Manual interventions:
wos_api_keymissing fromConfigUpdateinmain.py— key was silently discarded on UI savetask_executor._run_new_phase2_and_3, notauthor_searcher.pyRetro: Specify upfront which code path is the production pipeline vs legacy/dead code. Also define multi-source merge priority rules before implementation to avoid rework.
Test plan
config.jsonpersists key📋 WOS 结构化作者提取已启用(WOS + S2 双源融合)wos+pdf, affiliations from PDFwos+s2, affiliations from S2 where name-matched⚪ 未配置 WOS Key, behavior identical to before_author_debug.jsonin result folder for per-paper breakdown5.710) parsed correctly🤖 Generated with Claude Code